Notes:
Notes:
library(tidyverse)
## ── Attaching packages ───────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ──────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
pf<-read.csv('pseudo_facebook.tsv',sep='\t')
ggplot(data=pf,aes(x=age,y=friend_count))+
geom_point()
Response:there are striations at specific dates, and young people have more friends than older users
Notes:
ggplot(data=pf,aes(x=age,y=friend_count))+
geom_point()+
xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).
Notes:
ggplot(data=pf,aes(x=age,y=friend_count))+
geom_jitter(alpha=1/20)+
xlim(13,90)
## Warning: Removed 5173 rows containing missing values (geom_point).
Response:the bulk of young users have less than 1000 users.
Notes:
friendscatter<-ggplot(data=pf,aes(x=age,y=friend_count))+
geom_point(alpha=1/20,color="orange")+
xlim(13,90)+
coord_trans(y="sqrt")
friendscatter
## Warning: Removed 4906 rows containing missing values (geom_point).
explore the relationship between friends initiated vs. age Notes:
ggplot(data=pf,aes(x=age,y=friendships_initiated))+
geom_point(alpha=1/20,position=position_jitter(h=0))+
xlim(13,90)+
coord_trans(y="sqrt")
## Warning: Removed 5176 rows containing missing values (geom_point).
Notes: friends that see posts makes more sense if you bound as a percentage based on how many total friends they have. ***
Notes:
# age_groups<-group_by(pf,age)
# pf.fc_by_age<-summarise(age_groups,
# friend_count_mean=mean(friend_count),
# friend_count_median=median(friend_count),
# n=n())
# pf.fc_by_age<-arrange(pf.fc_by_age,age)
#
# head(pf.fc_by_age)
pf.fc_by_age<- pf%>%
group_by(age)%>%
summarise(friend_count_mean=mean(friend_count),
friend_count_median=median(friend_count),
n=n())%>%
arrange(age)
head(pf.fc_by_age)
## # A tibble: 6 x 4
## age friend_count_mean friend_count_median n
## <int> <dbl> <dbl> <int>
## 1 13 165 74.0 484
## 2 14 251 132 1925
## 3 15 348 161 2618
## 4 16 352 172 3086
## 5 17 350 156 3283
## 6 18 331 162 5196
Create your plot!
friendline<-ggplot(data=pf.fc_by_age,aes(x=age,y=friend_count_mean))+
geom_line()
friendline
Notes:
ggplot(data=pf,aes(x=age,y=friend_count))+
geom_point(alpha=1/20,color="orange")+
geom_line(stat="summary",fun.y=mean)+
geom_line(stat="summary",fun.y=quantile,fun.args=list(probs=.1),
linetype=2,color="blue")+
geom_line(stat="summary",fun.y=quantile,fun.args=list(probs=.5),
color="blue")+
geom_line(stat="summary",fun.y=quantile,fun.args=list(probs=.9),
linetype=2,color="blue")+
coord_cartesian(xlim=c(13,70),ylim=c(0,1000))
#### What are some of your observations of the plot? Response:Note: ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument, e.g.:
ggplot( … ) + geom_line(stat = ‘summary’, fun.y = quantile, fun.args = list(probs = .9), … ) To zoom in, the code should use thecoord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer.
Look up documentation for coord_cartesian() and quantile() if you’re unfamiliar with them.
See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.
Notes:
Notes:
cor.test(pf$age,pf$friend_count,method="pearson")
##
## Pearson's product-moment correlation
##
## data: pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
with(pf,cor.test(age,friend_count,method="pearson"))
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03363072 -0.02118189
## sample estimates:
## cor
## -0.02740737
Look up the documentation for the cor.test function.
What’s the correlation between age and friend count? Round to three decimal places. Response:
Notes:
with(subset(pf,age<=70), cor.test(age, friend_count),method='pearson')
##
## Pearson's product-moment correlation
##
## data: age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1780220 -0.1654129
## sample estimates:
## cor
## -0.1717245
Notes: The Pearson correlation evaluates the linear relationship between two continuous variables. … The Spearman correlation coefficient is based on the ranked values for each variable rather than the raw data. Spearman correlation is often used to evaluate relationships involving ordinal variables. Pearson is far too sensitive to influential points/outliers for my taste, and while Spearman doesn’t suffer from this problem, I personally find Kendall easier to understand, interpret and explain than Spearman.
If, for example, one variable is the identity of a college basketball program and another variable is the identity of a college football program, one could test for a relationship between the poll rankings of the two types of program: do colleges with a higher-ranked basketball program tend to have a higher-ranked football program? A rank correlation coefficient can measure that relationship, and the measure of significance of the rank correlation coefficient can show whether the measured relationship is small enough to likely be a coincidence.
pearson > normal spearman > robust to outliers ***
Notes:
ggplot(data=pf,aes(x=www_likes_received,y=likes_received))+
geom_point()+
coord_trans(x='sqrt',y='sqrt')
Notes:
ggplot(data=pf,aes(x=www_likes_received,y=likes_received))+
geom_point()+
xlim(0,quantile(pf$likes_received,.95))+
ylim(0,quantile(pf$likes_received,.95))+
geom_smooth(method ='lm',color='red')
## Warning: Removed 4936 rows containing non-finite values (stat_smooth).
## Warning: Removed 4936 rows containing missing values (geom_point).
## Warning: Removed 31 rows containing missing values (geom_smooth).
What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.
with(pf, cor.test(likes_received,www_likes_received,method='pearson'))
##
## Pearson's product-moment correlation
##
## data: likes_received and www_likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9473553 0.9486176
## sample estimates:
## cor
## 0.9479902
Response:
Notes:
Notes:
# install.packages('alr3')
library(alr3)
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
data(Mitchell)
?Mitchell
Create your plot!
ggplot(data=Mitchell,aes(x=Month,y=Temp))+
geom_point()
with(Mitchell,cor.test(Month,Temp,method='pearson'))
##
## Pearson's product-moment correlation
##
## data: Month and Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.08053637 0.19331562
## sample estimates:
## cor
## 0.05747063
Notes:
What do you notice? Response:
Watch the solution video and check out the Instructor Notes! Notes:
Notes:
ggplot(data=Mitchell,aes(x=Month,y=Temp))+
geom_point()+
scale_x_continuous(breaks=seq(0,203,12))
p1<-ggplot(data=subset(pf.fc_by_age,age<71),
aes(x=age,y=friend_count_mean))+
geom_line()
p1
pf%>%
mutate(age_with_months=age+(1-dob_month/12))->pf
# pf$age_with_months <- with(pf, age + (1 - dob_month / 12))
Programming Assignment
pf.fc_by_age_months<- pf%>%
group_by(age_with_months)%>%
summarise(friend_count_mean=mean(friend_count),
friend_count_median=median(friend_count),
n=n())%>%
arrange(age_with_months)
p2<-ggplot(data=subset(pf.fc_by_age_months,age_with_months<71),
aes(x=age_with_months,y=friend_count_mean))+
geom_line()
p2
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
grid.arrange(p2,p1,ncol=1)
Notes:
p1<-p1+geom_smooth()
p2<-p2+geom_smooth()
p3<-ggplot(data=subset(pf,age<71),
aes(x=round(age/5)*5,y=friend_count))+
geom_line(stat='summary',fun.y=mean)
grid.arrange(p2,p1,p3,ncol=1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'
Notes: depends on the purpose - sometimes you need all of them!
Reflection:scatter plots reveal potential correlations that cor tests cannot, jitter and alpha helps with graph readability, and bin/smoothing can dramatically change the look of a graph.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!